Contents

1 Introduction

This material is based on chapter 3 of the Modern Statistics for Modern Biology book by Susan Holmes and Wolfgang Huber.

We want to visualise data to

Our learning objectives are

2 Base plotting

The default graphics system that comes with R, often called base R graphics is simple and fast. It is based on the painter’s model or canvas, where different output are directly overlaid on top of each other.

Below, we display the relation between the optical density of the deoxyribonuclease (DNase) protein as measure by an enzyme-linked immunosorbent assay (ELISA) assay for all observations.

head(DNase)
##   Run       conc density
## 1   1 0.04882812   0.017
## 2   1 0.04882812   0.018
## 3   1 0.19531250   0.121
## 4   1 0.19531250   0.124
## 5   1 0.39062500   0.206
## 6   1 0.39062500   0.215
plot(DNase$con, DNase$density)
The default base `plot` function on the `DNase` data.

Figure 1: The default base plot function on the DNase data

We can add some features on the plot, such vertical dotted lines for all observed observations and customise the look and feel of the plot by setting specific arguments to the plot function.

plot(DNase$con, DNase$density,
     xlab = "DNase concentration (ng/ml)",
     ylab = "Optical density",
     pch = 1,
     col = "steelblue")
abline(v = unique(DNase$conc), lty = "dotted")
Customising a base figure using function arguments and overlaying new graphical features.

Figure 2: Customising a base figure using function arguments and overlaying new graphical features

If we wanted to change anything to that figures, we would need to repeat all the commands and modify accordingly. Any additinal command would be added to the existing canvas.

Exercise: How would you produce a figure that differentiates the different runs using base graphics?

Exercise: use the hist and boxplot functions to produce and histogram of all optical densities and a boxplot of the densities split by run.

The base graphics function are very effective to quickly produce out of the box figures. However, there is not global overview and parametrisation of the visualisation. The layout decisions have to be made up upfront (and if not adequate, the figure needs to be redrawn) and every aspect of the figure is customised locally as function arguments.

More generally, base graphics functions will work with various inputs: above we have worked with a data.frame, vectors and a formula. There is no unified type of data across all functions which makes it efficient for some types of data (if they match), but also very heterogeneous in terms of interface, leading to a lot of customisation code.

Finally, defaults, and colours in particular, are poorly chosen.

3 The ggplot2 package

ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. The theoretical foundation that supports the ggplot2 is the Grammar of Graphics1 Wilkinson, Leland. 2005. The Grammar of Graphics (Statistics and Computing). Berlin, Heidelberg: Springer-Verlag.. Instead of producing the figure, the user defines and assembles the visual components into an object that is the displayed. There is a book about ggplot22 Wickham, Hadley. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. that provides a good overview, but it is outdated. The ggplot2 web page (https://ggplot2.tidyverse.org) provides ample documentation.

To build a ggplot, we will use the following basic template that can be used for different types of plots:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()

We need first to load the ggplot2 package:

library("ggplot2")
ggplot(data = DNase)
We have only specified the data, and there's nothing to display yet.

Figure 3: We have only specified the data, and there’s nothing to display yet

ggplot(data = DNase,
       mapping = aes(x = conc, y = density))
`ggplot2` can now generate the axes, ticks and ranges based on the data.

Figure 4: ggplot2 can now generate the axes, ticks and ranges based on the data

ggplot(data = DNase,
       mapping = aes(x = conc, y = density)) + 
  geom_point()
Final figures with rendering of the data as a scatter plot.

Figure 5: Final figures with rendering of the data as a scatter plot

Exercise: compare the ggplot2 and base graphics version of the density vs. concentration plot. Which one do you prefer, and why?

It is possible to store the output of the ggplot function into a variable that can be visualised by either typing its name in the console or explicitly printing it (like any other variable).

gg <- ggplot(data = DNase,
             mapping = aes(x = conc, y = density)) + 
    geom_point()
print(gg)
Saving and printing an object.

Figure 6: Saving and printing an object

Let’s immediately customise this visualisation to - highlight how to re-use the gg object without repeating the plotting code and - how we can add additional (identical or different) geoms to a plot.

gg + geom_point(aes(colour = Run))
Adding another `geom_point` with its own (local) aesthetics.

Figure 7: Adding another geom_point with its own (local) aesthetics

Exercise: What do you think of the colours used to differentiate the different runs above?

Below is an example of more elaborated composition, overlaying points and a non-linear loess regression. But first, let’s load a object containing microarray data from the hiiragi2013.rda file. The data is originally available from the Hiiragi2013 package and Cell-to-cell expression variability followed by signal reinforcement progressively segregates early mouse lineages paper by Y. Ohnishi et al. Nature Cell Biology (2014) 16(1): 27-37. doi:10.1038/ncb2881.

Below, we combine the (transposed) expression values (extracted from the hiiragi2013 object with exprs) and sample metadata (ext-raced from the hiiragi2013 object with pData)

load("../Raw_Data/hiiragi2013.rda")
library("Biobase")
dftx <- data.frame(t(Biobase::exprs(hiiragi2013)), pData(hiiragi2013))
# or dftx<- as.data.frame(hiiragi2013)
dftx[1:10, 1:3]
##          X1415670_at X1415671_at X1415672_at
## 1 E3.25     4.910459    9.768979   10.411893
## 2 E3.25     7.526672    9.144228   10.918942
## 3 E3.25     6.956328    9.295010    9.495738
## 4 E3.25     6.424048   11.059831   10.317996
## 5 E3.25     5.105808    9.376749   11.143684
## 6 E3.25     5.856685    9.681017   10.234943
## 7 E3.25     5.059961    7.665886   10.642970
## 8 E3.25     4.574661    9.325743    9.304958
## 9 E3.25     8.123073    9.724729   11.037632
## 10 E3.25    5.464257    9.389818    9.754123
dftx[1:10, 45105:45109]
##          lineage genotype   ScanDate sampleGroup sampleColour
## 1 E3.25                WT 2011-03-16       E3.25      #CAB2D6
## 2 E3.25                WT 2011-03-16       E3.25      #CAB2D6
## 3 E3.25                WT 2011-03-16       E3.25      #CAB2D6
## 4 E3.25                WT 2011-03-16       E3.25      #CAB2D6
## 5 E3.25                WT 2011-03-16       E3.25      #CAB2D6
## 6 E3.25                WT 2011-03-16       E3.25      #CAB2D6
## 7 E3.25                WT 2011-03-16       E3.25      #CAB2D6
## 8 E3.25                WT 2011-03-16       E3.25      #CAB2D6
## 9 E3.25                WT 2011-03-16       E3.25      #CAB2D6
## 10 E3.25               WT 2011-03-16       E3.25      #CAB2D6
ggplot(dftx, aes(x = X1426642_at, y = X1418765_at)) +
    geom_point(shape = 1) +
    geom_smooth(method = "loess")
Modelling the relation between the expression of X1426642_at and X1418765_at.

Figure 8: Modelling the relation between the expression of X1426642_at and X1418765_at

And, adding colours representing the different samples

ggplot(dftx, aes(x = X1426642_at, y = X1418765_at))  +
  geom_point(aes(color = sampleColour), shape = 19) +
  geom_smooth(method = "loess")
Modelling the relation between the expression of X1426642_at and X1418765_at and annotating samples.

Figure 9: Modelling the relation between the expression of X1426642_at and X1418765_at and annotating samples

View(pData(hiiragi2013))
# Biobase::exprs(hiiragi2013)

4 Visualising 1D data

Let’s start by exploring some 1 dimensional visualisation. This is very relevant for omics data such as transcriptomics or quantitative proteomics, when contrasting the expression values across multiple samples.

First, we convert a microarray gene expression data to a data.frame, fit for some ggplot2 visualisation, focusing on genes Fgf4, Gata4, Gata6 and Sox2.

selectedProbes <- c(Fgf4 = "1420085_at", Gata4 = "1418863_at",
                    Gata6 = "1425463_at",  Sox2 = "1416967_at")
library("dplyr")
library("tidyr")
tmp <- data.frame(t(exprs(hiiragi2013[selectedProbes, ])))
names(tmp) <- names(selectedProbes)
tmp$sample <- rownames(tmp)
head(tmp)
##              Fgf4    Gata4    Gata6     Sox2  sample
## 1 E3.25  3.027715 4.843137 5.500618 1.731217 1 E3.25
## 2 E3.25  9.293016 5.530016 6.160900 9.697038 2 E3.25
## 3 E3.25  2.940142 4.418059 4.584961 4.161240 3 E3.25
## 4 E3.25  9.715243 5.982314 4.753439 9.540123 4 E3.25
## 5 E3.25  8.924228 4.923580 4.629728 8.705340 5 E3.25
## 6 E3.25 11.325952 4.068520 4.165692 8.696228 6 E3.25
genes <- gather(tmp, key = "gene", value = "expression", -sample)
head(genes)
##    sample gene expression
## 1 1 E3.25 Fgf4   3.027715
## 2 2 E3.25 Fgf4   9.293016
## 3 3 E3.25 Fgf4   2.940142
## 4 4 E3.25 Fgf4   9.715243
## 5 5 E3.25 Fgf4   8.924228
## 6 6 E3.25 Fgf4  11.325952

4.1 Histogram

genes %>%
    filter(gene == "Gata4") %>%
    ggplot(aes(x = expression)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Distribution of the Gata4 expression

Figure 10: Distribution of the Gata4 expression

4.2 Box plots

p <- ggplot(genes, aes(x = gene, y = expression, fill = gene))
bxplot <- p + geom_boxplot()
bxplot
A boxplot of expression values.

Figure 11: A boxplot of expression values

Exercise: Repeat the above figure replacing the boxes by violins using the geom_violin. Which one do you think does a better job?

4.3 Dot plots and beeswarm plots

When the data aren’t too large, it is also possibly to visualise all points to get a sense of their distribution.

jtrplot <- p +
    geom_jitter(aes(colour = gene)) +
    theme(legend.position = "none")

In a dotplot, the position of the points along the y axis is discretised into bins (set as 1/6 below) and the points are then stacked next to each other.

dotplot <- p + geom_dotplot(binaxis = "y", binwidth = 1/6,
                            stackdir = "center", stackratio = 0.75,
                            aes(color = gene)) +
    theme(legend.position = "none")

The beeswarm algorithms tries to avoid overlapping points.

library("ggbeeswarm")
beeplot <- p + geom_beeswarm(aes(color = gene)) + 
    theme(legend.position = "none")
library(patchwork)
jtrplot + dotplot + beeplot
Showing all expression values using jittering (left), a dotplot (centre) and a beeswarn plot.

Figure 12: Showing all expression values using jittering (left), a dotplot (centre) and a beeswarn plot

4.4 Density and ecdf plots

densplot <- ggplot(genes, aes(x = expression, color = gene)) +
    geom_density() +
    theme(legend.position = "none")
ecdfplot <- ggplot(genes, aes(x = expression, color = gene)) +
    stat_ecdf() +
    theme(legend.position = "none")
densplot + ecdfplot
Density and cumulative density functions of expression values.

Figure 13: Density and cumulative density functions of expression values

4.5 Summary

  • Boxplot makes sense for unimodal distributions (see below).

  • Histogram requires definition of bins (width, positions) and can create visual artefacts especially if the number of data points is not large.

  • Density requires the choice of bandwidth; obscures the sample size (i.e. the uncertainty of the estimate).

  • ecdf does not have these problems; but is more abstract and interpretation requires more training. Good for reading off quantiles and shifts in location in comparative plots.

  • beeswarm: for up to a few dozens of points, just show the data.

The number of modes of a distribution depends on scale transformation of the data.

sim <- data.frame(x = exp(rnorm(n = 1e5,
                                mean = sample(c(2, 5),
                                              size = 1e5,
                                              replace = TRUE))))

p1 <-  ggplot(sim, aes(x)) +
    geom_histogram(binwidth = 10, boundary = 0) +
    xlim(0, 400)
p2 <-  ggplot(sim, aes(log(x))) +
    geom_histogram(bins = 30)
p1 + p2
## Warning: Removed 7913 rows containing non-finite values (stat_bin).
Histograms of the same data without (left) and with (right) log-transformation.

Figure 14: Histograms of the same data without (left) and with (right) log-transformation

This also applies to density plots.

5 Visualising 2D data

dfx <- as.data.frame(Biobase::exprs(hiiragi2013))
scp <- ggplot(dfx, aes(x= `59 E4.5 (PE)`, 
                       y = `92 E4.5 (FGF4-KO)`)) 
scp + geom_point()
Scatter plot comparing the expression of a wild-type vs. FGF4 KO.

Figure 15: Scatter plot comparing the expression of a wild-type vs
 FGF4 KO.

Exercise: The over-plotting of the dots stops us from learning anything about the density of the different regions of the plot. Use the alpha parameter to geom_points between 0 (full transparency) to 1 (opaque, default).

scp + geom_density2d(h = 0.5, bins = 60)
Focusing on contours rather that individual values.

Figure 16: Focusing on contours rather that individual values

scp + geom_hex() 
Local density summaries.

Figure 17: Local density summaries

6 Visualising data along more dimensions

When visualising data along additional dimension, we can parameterise the points by setting their shape, colour, size and transparency, that can be set with point aesthetics such as fill, color (or colour), shape, size and alpha.

A very powerful way to represent data along additional dimensions is facetting, i.e. producing sub-plots for different subsets of the data. Below, we first re-annotate the data using some regular expressions

p1 <- ggplot(dftx, aes(x = X1426642_at, y = X1418765_at, colour = lineage)) +
  geom_point()
p2 <- ggplot(dftx, aes(x = X1426642_at, y = X1418765_at)) +
  geom_point() +
    facet_grid( . ~ lineage )
p1 + p2
Different sub-plots for different lineages using colours (left) of facets(right) to distinguish the different lineages.

Figure 18: Different sub-plots for different lineages using colours (left) of facets(right) to distinguish the different lineages

ggplot(dftx,
       aes(x = X1426642_at, y = X1418765_at)) +
    geom_point() +
    facet_grid( Embryonic.day ~ lineage )
Different sub-plots for different lineages and embryonic stages.

Figure 19: Different sub-plots for different lineages and embryonic stages

Exercise: Use facets to visualise the distribution of the four Fgf4, Gata4, Gata6 and Sox2 genes in the genes data using histograms.

7 Interactive visualisation

library("plotly")

scp <- ggplot(dfx[1:100, ],
              aes(x= `59 E4.5 (PE)`, y = `92 E4.5 (FGF4-KO)`))

scp2 <- scp + geom_point()
ggplotly(scp2)

Figure 20: Interactive visualisation with plotly

See https://plot.ly/r for examples of interactive graphics online.

Session info

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=French_Belgium.1252  LC_CTYPE=French_Belgium.1252   
## [3] LC_MONETARY=French_Belgium.1252 LC_NUMERIC=C                   
## [5] LC_TIME=French_Belgium.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] patchwork_0.0.1     ggbeeswarm_0.6.0    tidyr_0.8.3        
##  [4] plotly_4.9.0        Hmisc_4.2-0         ggplot2_3.2.0      
##  [7] Formula_1.2-3       survival_2.44-1.1   lattice_0.20-38    
## [10] dplyr_0.8.2         Biobase_2.44.0      BiocGenerics_0.30.0
## [13] BiocStyle_2.12.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1          assertthat_0.2.1    digest_0.6.19      
##  [4] mime_0.7            plyr_1.8.4          R6_2.4.0           
##  [7] backports_1.1.4     acepack_1.4.1       evaluate_0.14      
## [10] httr_1.4.0          highr_0.8           pillar_1.4.2       
## [13] rlang_0.4.0         lazyeval_0.2.2      rstudioapi_0.10    
## [16] data.table_1.12.2   hexbin_1.27.3       rpart_4.1-15       
## [19] Matrix_1.2-17       checkmate_1.9.3     rmarkdown_1.13     
## [22] labeling_0.3        splines_3.6.0       stringr_1.4.0      
## [25] foreign_0.8-71      htmlwidgets_1.3     munsell_0.5.0      
## [28] shiny_1.3.2         httpuv_1.5.1        compiler_3.6.0     
## [31] vipor_0.4.5         xfun_0.8            pkgconfig_2.0.2    
## [34] base64enc_0.1-3     htmltools_0.3.6     nnet_7.3-12        
## [37] tidyselect_0.2.5    tibble_2.1.3        gridExtra_2.3      
## [40] htmlTable_1.13.1    bookdown_0.11       viridisLite_0.3.0  
## [43] later_0.8.0         crayon_1.3.4        withr_2.1.2        
## [46] MASS_7.3-51.4       grid_3.6.0          xtable_1.8-4       
## [49] jsonlite_1.6        gtable_0.3.0        magrittr_1.5       
## [52] scales_1.0.0        stringi_1.4.3       reshape2_1.4.3     
## [55] promises_1.0.1      latticeExtra_0.6-28 RColorBrewer_1.1-2 
## [58] tools_3.6.0         glue_1.3.1          beeswarm_0.2.3     
## [61] purrr_0.3.2         crosstalk_1.0.0     yaml_2.2.0         
## [64] colorspace_1.4-1    cluster_2.1.0       BiocManager_1.30.4 
## [67] knitr_1.23